Learning to summarize from human feedback
In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences.
ROUGEのような既存の評価指標でなく、reward modelで要約を評価する、ということ?(積ん読)
Figure 5
Reward Hacking
KL Penaltyによる解決